A study of vectorization methods for unstructured text documents in natural language according to their influence on the quality of work of various classifiers
Annotation
The widespread increase in the volume of processed information at the objects of critical information infrastructure, presented in text form in natural language, causes a problem of its classification by the degree of confidentiality. The success of solving this problem depends both on the classifier model itself and on the chosen method of feature extraction (vectorization). It is required to transfer to the classifier model the properties of the source text containing the entire set of demarcation features as fully as possible. The paper presents an empirical assessment of the effectiveness of linear classification algorithms based on the chosen method of vectorization, as well as the number of configurable parameters in the case of the Hash Vectorizer. State text documents are used as a dataset for training and testing classification algorithms, conditionally acting as confidential. The choice of such a text array is due to the presence of specific terminology found everywhere in declassified documents. Termination, being a primitive demarcation boundary and acting as a classification feature, facilitates the work of classification algorithms, which in turn allows one to focus on the share of the contribution that the chosen method of vectorization makes. The metric for evaluating the quality of algorithms is the magnitude of the classification error. The magnitude of the error is the inverse of the proportion of correct answers of the algorithm (accuracy). The algorithms were evaluated according to the training time. The resulting histograms reflect the magnitude of the error of the algorithms and the training time. The most and least effective algorithms for a given vectorization method are identified. The results of the work make it possible to increase the efficiency of solving real practical classification problems of small-volume text documents characterized by their specific terminology.
Keywords
Постоянный URL
Articles in current issue
- A study of a silicone film deposited on quartz glass under laser radiation
- Optical composites based on organic polymers and semiconductor pigments
- A new algorithm for the identification of sinusoidal signal frequency with constant parameters
- A study of silicon p-n structures with mono and multifacial photosensitive surfaces
- Detection of yawning in driver behavior based a convolutional neural network
- A Game Theory approach for communication security and safety assurance in cyber-physical systems with Reputation and Trust-based mechanisms
- A study of the influence of human factors on the speed of urban rail transport
- An algorithm for detecting RFID-duplicates
- Reduction of LSB detectors set with definite reliability
- Classification of objects in images with distortions based on a two-stage topological analysis
- Dimensionality reduction of the attributes using fuzzy optimized independent component analysis for a Big Data Intrusion Detection System
- An optimal swift key generation and distribution for QKD
- Recognition the emotional state based on a convolutional neural network
- Intellectualization of personnel development management in high-tech service-oriented companies
- A study of the efficiency of the magnetic compass correction system
- A new analytical model of drain current and small signal parameters for AlGaN-GaN high-electron-mobility transistors
- Imputation and system modeling of acid-base state parameters for different groups of patients
- Construction of movement trajectories for objects based on the Dubins car problem, taking into account constant external influences
- A mathematical model of an epidemic with an arbitrary law of recovery
- Simulation of the pulsed outflow of air and fine powder mixture, partially filling the discharge channel
- Vectorized numerical algorithms for the solution of continuum mechanics problems
- A comparative analysis of computational intelligence algorithms for estimation of LTE channels
- Implementation of a clinical decision support system to improve the medical data quality for hypertensive patients